A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

نویسندگان

  • Azzam Haidar
  • Tingxing Dong
  • Stanimire Tomov
  • Piotr Luszczek
  • Jack J. Dongarra
چکیده

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The hybrid CPU-GPU algorithms rely heavily on using the multicore CPU for specific part of the workload. But in order to benefit from the GPU’s significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorization to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K40 GPU.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

NORTH- HOLLAND High Performance Algorithms for Toeplitz and Block Toeplitz Matrices

In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLAS3 primitives and makes efficient use of a memory hierarchy. We present three algorithms for inde...

متن کامل

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...

متن کامل

Orthonormal integrators based on Householder and Givens transformations

We consider refined implementations of algorithms based on Householder and Givens transformations to find the Q-factor in the QR factorization of a matrix solution of linear time dependent differential systems. After discussing the algorithms, we introduce a suite of integrators, QRINT, and provide numerical testing to show the efficiency and accuracy of our techniques.

متن کامل

A New Much Faster and Simpler Algorithm for Lapack Dgels

We present new algorithms for computing the linear least squares solution to overde-termined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min kAX ? BkF and the transposed formulation min kA T X ? BkF , i.e, four diierent problems in all. The functionality of our implementation corresponds to that of the L...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015